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ABSTRACT 

This literature review summarizes what is cucreotly 
known about the agreement among six measures of writing skills. Three 
of these methods involve the application of human judgment in scoring 
or rating a piece of writing: holistic, analytical, and& primary trait 
scoring. Two methods involve anatomical, or taxondmic analysis of 
piece of writing: computer analysis and syntactic analysis. The final 
method involves the use of objective (usually .multiple-choice) tests 
of writing-related skills. *The research on relationships among the 
various measures of writing skills admits of relatively few. 
well-established generalizations. Relationships among some pairs of 
measures have been well researched,; while .relationships among other 
pairs of measures have been virtually untouched by empirical studies. 
Aspect or National Assessment ,( NAEP ) dealt withr in this document: 
Procedures (Scoring). (Author/PN) 
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y Abstract 
Although "writing skill" is often treated as a reasonably well-defined 
trait,/ ability, or skill', there are a variety of seemingly disparate methods 
all purporting to measure this skill. To what extent do these various methods 
.agree in their measurement of writing skill? This literature review 
summarizes what is currently known about the agreement among six measures of 
writing skill: holistic, analytic, primary trait, computer-based, syntactic 
indices, and objective tests. Relationships among some pairs pf measures have 
been well researched, while relationships among other pairs of measures have 
been. virtually untouched by empirical studies. 
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RELATIONSHIPS AMONG MEASURES OF WRITING SKILL , 

INTRODUCTION 

There appears to be an assumption both in popular discussions of the topic 
as well as in the professional literature that there is such a thing as 
"writing skill" which is a reasonably unitary trait or ability (at least as 
long as we confine the reference to expository forms of writing and exclude 
such things as poetic writing) . Despite this assumption, we have a variety of 
methods, for assessing .writing skill, some of which appear. on the surface to be 
quite different from one another'. In fact, it is not unusual to encounter 
authors extolling one method while condemning another, as if the different 
methods had nothing in common, i.e. that they were measuring radically 

<t v f 

, different abilities or that one was a "good" measure and the other a "bad" 

J* . 

measure • 

One does /wonder to what extent the various techniques purporting to 

measure waiting skill are all tapping the same function. Are the distinctions 

... / • 

among the measures merely physical and verbal, while being roughly equivalent 
in what they actually measure? if a curriculum program Is declared successful 
when one technique is used as the criterion measure, is it likely that the 

■ ' i 

same conclusion* would have been reached had another measure ^been used? If it 
is announced to the world that students* writing skill has improved (or 
declined)-,- must the announcement be qualified by a description of how the 
skill was assessed? ■/». 

The. classical problem, of course, involves the relationship between essay 
tests" and "objective tests" of writing ability, a problem which is robted in 
the very foundations of what we now call the field of 'tests and .measurements 
and which served as a vehicle for many of the developments within that field. 
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However, the "essay test vs. objective test" is an oversimplified formulation 
of the complete question today, although still a very important segment of the 
question. The last dozen or so years have witnessed the emergence of several 
new methods purporting to measure Writing skill, each being quite different in 
character, for one_ reason or another, from either essay or objective tests, as 
those tests have "been defined traditionally. Hence, today a review of 
relations between alternate measures of writing skill must go much beyond the • 
"objective and subjective testing techniques" covered by Huddleston' s (1954) 
thorough, scholarly review of moire than 25 years ago. 

We have identified six types of measures "currently used to assess writing 
skill. Three of these methods involve the application of human judgment in 
scoring or rating a piece of writing: holistic, analytical, and primary trait 
scoring. Two methods involve a kind of anatomical or taxonomic analysis of a 
piece of writing: computer analysis and syntactic analysis. And the final 

method ltTv>lves the use of objective (usually multiple-choice) tests of 

v., 

writing-related skills, More complete descriptions of each method are 

• V 

provided in the next section to serve as a preface to the review of the 
relationships between the six measures. 

(Before proceeding with the review, it may be wise to distinguish between „ 
the problem of essay vs. objective tests as measures of writing skill and the 
problem of essay vs. objective tests as measures of knowledge or skill in some 
^content area such as history or mathematics. The latter issue, taken tip in 
wor>s such as that of Coffman (1971) is not of concern to us in this review;, 
while\the former issue is one of the central problems in. our review.) 

DEFINITIONS OF THE SIX MEASURES 

Although each of the methods of measuring writing ability "has a number of 
variation^, each is also characterized by a basic theme or approach. We 
introduce the review with a description of the basic theme for each measure, 
with some notes on common variations and typical applications. 



Holistic Scoring , In holistic scoring of essays, raters make a single, 
overall judgment of the quality of a piece of writing. Exactly what is meant 
by "quality" may vary somewhat from one study to another, but most typically 
it is intended to include such factors as capitalization and punctuation, 
aptness of word choice, grammar, organization, spelling, sentence structure, 
and imagination; penmanship is usually excluded. The raters are instructed to 
weigh all of these factors together in roughly equal proportions to form their 
overall judgment of quality. Raters are also instructed to make no marks 
(corrections, comments on the paper) and to move through each paper at a 
fairly rapid pace; experienced raters move through papers in the 150-300 word 
range (the product of 20-40 minutes of writing) at approximately a minute per 
paper. 

The rater's final judgment is usually quantified on a point scale, ranging 
from low values (poor quality) to high values (high quality). There is no 
standard set of points to use for the scale; examples can be found of scales 
from 3 to' 10 points. tj 

It is highly, recommended that the raters receive training in the use of 
the holistic methodV The training is designed to ensure that raters are 
consistent over time and among one another; that one or two aspects of good 
writing are not receiving undue weight; and that the rating proceeds at an 
appropriate pace. Very often, the training involves, the use of "anchor 
points," i.e. papers which have been preselected by experienced raters to 
illustrate various points along the score scale. By exposure to these anchor 
points, raters learn of the expected range of writing skill they will 
encounter and the degree of difference in skill represented by successive 
points along the score scale. Also, raters who cannot conform themselves to 
these anchor points after some amount of training may be eliminated from the 
pool of raters. 



The more formal, systematic applications of holistic scoring always use 

trained, raters. However, it must be admitted that many applications of 

o 

holistic scoring have not used training, or at least it is not apparent from 
the description of the study whether there was any training and, if so, how 
much. 

Most applications 7 of holistic scoring involve the use of more than one 
rater per paper. There is a seemingly endless variety of ways to go beyond 
the suse of one rater. Sometimes two raters are used, and their independent 
ratings are averaged or summed. (Hence, especially s in the British literature, 
holistic scoring is sometimes referred to as "double impression" scoring; see, 
e.g. Wood and Quinn, 1976.). Sometimes two raters are used, and if their 
ratings differ by a certain number of scale points, a third rater is 
introduced. Sometimes each papex^is read by three, four, or five raters. The 
practice of using two raters for each paper but introducing^ third rater to 
resolve discrepancies seems to be gaining popularity, although by no means can 
it be considered the standard methodology in this area. 

In addition to the usual variations on holistic methodology mentioned 
above, there are some unusual variations, including paired comparisons (each 
essay compared with each other essay), Q-sorts, rankings, and so forth. For 
purposes of this review, all of these will be treated as applications of the 
holistic method, since they all follow the basic theme of making an overall 
judgment of the quality of writing. I 

The holistic method is one of the oldest procedures for assessing writing 
skill. For many years, it was used by the College Board, then laid to rest 
after much debate about its shortcomings (see Pearson, 1955), then resurrected 
just recently with the renewed interest in writing skill a,t the college 
level. The Hudelson English Composition Scale (Hudelson, 1921), a collection 
of essays representing different scale values of writing, was published in 



1921, the same year (and, incidentally, by the same publisher) as the Otis 
Group Intelligence Scale. The preface to the test manual, extolling the 
merits of a systematic and direct assessment of writing skill, reads as if it 
were written just yesterday. One of the most noteworthy applications of 
holistic scoring has been its use in each of three cycles of writing 
assessment by the National Assessment of Educational Progress (see, e.g* NAEP, 
1981). Following the" practices established by NAEP, a number of states have 
applied 'holistic scoring in statewide assessments (Fredrick, 1979). 

Analytic , Scoring . In analytical scoring, raters score each essay on 
specific qualities, such as creativity, organization, mechanics, style, etc. 
Like holistic scoring, analytic scoring depends on subjective judgments made 
by raters, with variations in the number of raters Used from one application f 
to another. Sometimes the scores on the separate factors or qualities simply . 
stand on their own, while other times the separate scores are also averaged cjr 
summed to yield a total score. 

There is clearly no consensus regarding how many factors should be used in 
analytical scales. Examples can be found of scales with just two factors 
(e.g". mechanics and creativity) and of scales with as many as 18 factors. 
Most analytical scales, however, yield about five or six scores. One of the 
most well-knoyn analytical instruments is the Composition Evaluation Scales 
created by Diederich, French and Carlton. Once used by the College Board, 
analytical scoring was discontinued, largely because the method did not prove 
more reliable than the more efficient holistic method. Diederich 1 s (1974) 
highly readable Measuring Growth in English , often incorrectly cited as an 
example of holistic scoring, actually uses an analytical scale. 

It is worth noting that although' analytical scoring is routinely listed as 
one of the major approaches to the assessment of writing, actual applications 
of it in either the resea^ch'literature or in large scale assessments (e.g. 



judged by two raters on a 
no evidence of the primary 



state programs) is rather ra*e^ 

Primary Trait Scoring . In primary trait scoring, raters judge to what 
extent- a sample of writing' contains the "primary trait" that it must have in 
order to accomplish its purpose. Writing; tasks are carefully constructed so 
that the purpose and audience for the piece of writing are precisely defined. 
Students 1 essays (or other written products, perhaps just a note) are then 
judged according to how weir their writing achieves the defined purpose, i.e. 
exhibits the primary trait. For instance, if the dominant purpose of an 
exercise is explanatory, the primary , trait will be explanation through 
selection and ordering of .details. In a typical application, essays are 

1-4 scale, with "l" for essays which show little or 
trait, "2" for essays showing minimal evidence of 
the primary trait, "3" fojc essays demonstrating competence with the primary 
trgit and "4" for essays demonstrating excellence in the primary trait. 
Precisely defined scoring criteria for each score point are outlined and used 
for each writing task. 

Essays can be rescored for ^secondary, tertiary, or, presumably any lower 
order trait. Such traits consist of any well-defined rubric for viewing the 
piece of writing other than the primary trait. For example, a letter, after 
being~scored for the primary trait df whether or not the intended message was 
conveyed, couldj be scored for the secondary trait of appropriateness of letter 
format. 

The primary trait scoring method was developed in the late '70s for the 
National Assessment of Educational Progress (NAEP) in response to NAEJ's need 
to explain more fully the writing tasks that school children wer§ able td do. 
It is now more prominent than holistic scoring' ih NAEP's writing assessments 
(see e.g. NAEP, 1981) and has also been adopted in many statewide writing 
assessments (Fredrick, 1979). 4 
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Syntactic Scoring . This approach to writing assessment is based on the 
analysis of , grammatical forms and syntactical structures of a student's 
issay* Hunt's (1965) -research, which revealed the ways in which children's 
writing becomes more syntactically complex as they advance through the grades, 
laid the groundwork for syntactical complexity analysis. The three major 
indices of syntactical complexity are words per T-unit (a subject and verb and 
all its surrounding modifiers^? words per clause, and clauses per T-unit. In 
syntactical scoring, scorers segment essays into T-units and conduct other 
types of frequency coiints of particular/ syntactical structures that have been 
shown to- change as students become older. Widely used in sentence combining 
research, syntactical scoring has been added to the most 'recent NAEP writing 

\ 

assessment. ( 

Syntactic analysis is very widely used by researchers whose training has 
been primarily in the language arts field and is, J infrequently used by those 
whose training has been^ in the measurement area. Hence, for example, 
syntactic 'analysis is used' routinely in articles appearing in the Journal for 
Research in the Teaching of English , but almost never appears in the Journal 
of\ Educational Measurement . *"*■ 

I Computer Scoring . Computer scoring of essays refers to analysis of 
variables within written discourse that are amenable to mechanical counts by a 
computer. Average word length, number and types of punctuation, sentence 
length, and other such features are machine counted. In thAs method of 
scoring, the essays are typed intdythe computer and a program to analyze 
countable features is run. Ordinarily, machine countable features of the 
writing which correlate most highly\with judgments of writing quality (derived 
, by* holistic or analytical scoring) are compiled into some type of 
computer-generated score. Pioneering! work was conducted in this area by Ellis 
Page (see Page, 1967, 1968; Page and Paulus, 1968), and followed up^hy 
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Slotnick (1971,, 1972, 1974) and Slotnick,. Knapp and Bussell (1971). 
, Objective Tests. A final method, used to assess writing skill is provided • 
by objective, standardized tests of the multiple-choice variety. Some of 
these tests, particularly ones designed for use with high school and college 
students, are designed specifically to assess writing skill; examples are the 
Tes^ of Standard Written English , the Missouri Col lege English Test, and the 
follege Placement English Test . Such tests are usually formulated in terms of 
some logical analysis of the writing act or writing subskills, and validated 
in terms of the test score correlation with judgments of' writing quality as 
represented by X .holistic scores on essays or grades in writing courses. ; * 
Other objective tests of language skills, particularly those included in 
standardized achievement batteries for use at the elementary school level, are 
designed to have content validity for the language arts curriculum. That 
curriculum, it must be notedj Includes much besides writing ' skill. Hence, 
elementary school language tests often include items on library cards, types 
of reference works, alphabetizing, poetry, listening skills, and so forth, in 
addition to such presumably writing-related skills as spelling, grammar, and 
punctuation. Sometimes items in these different areas yield separate scores, 
while at other times they are simply lumped together in one Total Language 

score. \ 

Some of the recent literature on the assessment of writing refers to 
objective tests as "indirect" measures of writing skill, while classifying 
such methods as holistic, analytical, and primary trait scoring as "direct- 
measures. The usage is unfortunate* and misleading. It is true that objective 
tests yield an indirect measure of writing skill, but it is not true that 
holistic scoring (or any of the other judgment-based score) yields a direct 
measure of writing skill. In fact, we probably do not have direct measures of 
.any constructs such as writing skill, or reading ability or the myriad of 



; . • • ■ ' c . ■ . 

other traits/ skills, and abilities of interest in educational and. 
psychological measurement. 

• Other " Measures ." We have identified six major methods of assessing 
- ( 

writing skill extant in the research literature. There are, of course, an 
almost limitless number of other ways of looking at writing skill, including 

rhetorical analysis or literary criticism, error counts (spelling, 

* """" » ' 

subject-verb agreement, etc.), and; the infamous "red-pencil-in-the-margin of 

the English teacher. Error counts are sometimes included in lists of writing 

measures (see, e.g. "writing mechanics" in Spandel and Stiggins, 1980) and 

occasionally used in formal studies. But all of these "other methods" are 

4 

used too infrequently in the research literature to warrant inclusion in our 
list of major methods of measuring writing skill-. / 

RELIABILITY u 

Our Interpretation of data on relationships among the six measures of 

writing skill, which is the "main^focus of attention in this paper, will be 

influenced by the reliability of Wch method. This is the classical problem 

of attenuation due to unreliability. Ideally, eactC study to be considered 

later would address this issue, providing information which would allow one to 

estimate the disattenuated relationship. Unfortunately, this is not always 

the case: in some studies, the relationships among measures were of only 

incidental concern so thatKthe reliability issue was not explored, while in 

other studies the authors seem oblivious"to the attenuation problem. Hence, 

in this section, we attempt to provide ^ general review of what is known about 

the reliability of eaeh method, while acknowled^ng-J^at these general 

findings may not apply to each study taken up later. 

Four types of reliability determinations will enter intq the discussion^ 

First, scorer reliability will be 'a prominent issue for. those scores which 

$\ - * . 

involve ratings or some other type of human judgment. Hence, scorer 
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reliability needs to be determined fo* holistic, analytical, primary trait, 
and syntactic maturity scores, the latter be.cause,' while some of the counts 
are quite mechanical, other counts do involve a judgment. For practical 
purposes, objective test scores and computer-based scores may be considered to, 
have perfect scorer reliability. 

There are two subcategories of scorer reliability to consider. First, 



tlfere is intra-seorer reliability: the consistency with which one rater 
scores dr .judges a given set of papers on different occasions or under varying 
conditions. 'Second, there is inter-scorer reliability: the consistency with 
which different raters score or judge a given set of papers. Most 
investigations of scorer reliability deal with the latter issue. 

The second major type of reliability to be determined may be referred to 
as alternate- f obi reliability. . The terminology here is derived from usage' 
within the area of objective tests, where the meaning of alternate forms is 
well-established. The^ analogous case for all of the other types of scores 
(all of which depend'' upon examinees producing a piece of writing) involves the 

c 

consistency beteween scores derived from two different pieces of writing which 
are judged to be roughly equivalent tasks (e.g. two impromptu, 20-minute 
essays of an argumentative nature). In contrast, we may refer to cross-task 
reliability which involves consistency between scores derived from two pieces 
of writing which are judged to be nonequivalent tasks (e.g. writing a short 

^-trfiank you note vs. writing a lengthy research paper). 

Finally, there are the various coefficients of internal consistency 

^reliability applied to objective tests. Of course, from a theoretical 
perspective, these indices of reliability can be thought of as specific 
applications of alternate form reliabilities (or vice versa). They could be 
applied with some ease to computer and syntactic scoring, and possibly to 
holistic, analytic and primary trait scoring, although the application iti the 
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latter cases might be strained beyond intelligible limits. However, as a 
practical matter, internal consistency reliability is used almost exclusively 
with objective tests and is reported separately from alternate form 
reliability for such tests. 

True test-rete,$t reliabilities are rarely reported for any of the 

measures. .Even when authors do refer to test-retest reliability, they are 

V 

usually using alternate form data, i.e. data based on two different writing 

i , • — . ____ 

topics. 

Reliability of Holistic Scoring . One principle that ha,s been established 
for a number of years is that student writing can indeed be reliably judged. 
Many studies have found that when proper conditions are met, interscorer 
reliability of .80 or above can be achieved (Cooper and Odell, 1977; 
Diederich, 1974; Hogan and Mishler, 1980; Page, 1968). Most researchers agree 
that this level of reliability is possible, despite a widespread notion to the 
contrary among laypersons. 

On the question of alternate form reliability, opinion is somewhat more 
divided. Anderson (1960) notes that "the discrepancy between tests 
[holistically scored essays] is evidence of the unrepresentative character of 
a solitary essay. The significant variability among testing occasions is 
evidence of fluctuation in the function underlying composition ability" 
(p. 90). The Anderson study employed analysis of variance rather than a 
correlational methodology for studying reliability*. Braddo£k, Lloyd-Jones, 
and Shoer (1963) cite Kincaid (1953) as also having demonstrated substantial 
fluctuation in writing scores across occasions, lending support tp Anderson's 
contention that the alternate foTm reliability of holistically-scored essays 

r 

is unacceptably low. 

'Hogan and, Mishler (1980) report a correlation of .71 between two 
holistically-scored essays written on two occasions by Grade 8 students, a 
finding which supports DiederichVs research, with high school age or older 
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students. However, Hogan «nd Mishler found a slightly higher correlation of 
.81 at the- Grade 3 level. Thus, alternate form reliability of holistic 
scoring appears to be noticeably lower than inter scorer reliability, at least 
among older students.^ 

How stable is the holistic score across writing tasks (cross-task 
reliability)? The topic, the mode of discourse, the time allotted for 
writing, the intended audience, and the instructions given to students are a 
few of the task variables that might presumably be investigated. Braddock et. 



al. (1963) cite several studies that suggest that, the to>ic students write on ' 



influences the quality of writing produced and the resu^ti^g holistic score. 
Braddock et. al. suggest that mode of discourse will have a substantial effect 
on the holistic scores; the also notes the need for research on the optimum 
time needed for writing during testing. Overall, research has not been 
definitive on matters relating to the stability of writing across tasks, as 
measured by the holistic scoring method. 

Reliability of Computer Scoring . In computer scoring, of course, the 
question of "scorer" reliability is not a problem since we are not dealing 
with subjective human judgments, hence, scorer reliability may be considered 
perfect. Page and Paulus^ (1968) have investigated the alternate form 
reliability of each of the' 30 variables in their * scoring system. In 
correlating the variables, for Essay C and Essay D (writtW about a month 
apart)', Page and Paulus report correlations ranging from /. -02 to .65. Some of 
the most unreliable variables were number of slashes (.-j)2), presence of a 
title on the essay (.05), number of "Type B" declarative sentences (.09) and 
the number of. relative pronouns (. 17) . Among the variables with the highest 
reliability were average sentence length (.63), number of commas (.61), . 
average word length (.62), standard deviation of word length (.61), and number 
of common words on the Dale list (.65). Thus the alternate form reliability 
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of the thirty- computer countable elements used in Page's study varied 
considerably although the variables of ultimately greatest interest (as we 
shall presently see) tended to have reliabilities of .60 - .65. 

^ of Analytic Scoring . Several studies have addressed the 

''question of in^erscorer reliability when analytical scoring of essays is 
used. Some studies have contrasted the interscorer reliability of analytical 
Vscoring with that of the faster holistic*: method and have come to the 
conclusion that the interscorer reliability of each method is about the same 
(Coward, 1952). A more recent investigation (Follman and Anderson, 1967) 
^compared four analytical^methods ( The Diederich Rating Scale , The California , 
Essay Scale , The Cleveland Composition Rating Scale , The Follman English 
Mechanics Guide ) and a method similar to the holistic method, which was dubbed 
Everyman's Scale . Resulting average interscorer reliability coefficients 
ranged from .95 using -the Follman English Mechanics Guide to .81 using the 
Cleveland Composition Rating Scale . (Reliabilities for separate subscales 
within the analytical scales were not reported.) Reliability using the 
holistic method was .95. These results show that similar levels of 
interscorer reliability (.80 or greater) can be attained with either holistic 
or analytic scoring. 

In fact, the interscorer reliability coefficients reported for five 
different analytical scales listed in Measures for Research and Evaluation in 
the Language Arts (Fagan, Cooper and Jensen, 1975), a compilation of 
unpublished instruments, are above .80. ^Jor Diederich, French and Carlton's 
E.T.S. Composition Evaluation Scales , an interscorer reliability of .90 is 
noted. Other measures described in Measures for Research and Evaluation in 
the Language Arts report similarly high interscorer reliabilities of .83 for 
the Glazer Narrative" Composition Scale , .97 for the Sager Writ ing Scale, .73 
for the Literary- Scale , and 67-100 percent agreement for the Schrbeder 

O 

Composition Scale . -M - 



\ 
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No information c^^^«~rel^bi^^^ scorer reliability could 

be found for analytic scores. 

Reliability of Objective Tests . As for computer scoring, scorer 
reliability may be considered virtually perfect for objective tests. For 
regularly published objective tests, i.e. ones which have undergone the 
customary round of ^republication research, reliability is very much a ' 
*^ function of test length. For objective tests with about 50 or 60 items, 

alternate form reliabilities are usually in the range of ; t85-.90 and various- 
internal consistency measures of reliability in the range, of .90-. 95. For 
\bjective tests we do not really have an analog for- cross-task' reliability , 
unless entirely separate tests of English skills (say the Missiouri / College 
English Tests vs. 'the College Board's Test of Standard Written Engl ish) are 
thought" to fill this. gapi Such alternate measures usually correlate about .70 
-.85. Test manuals, at least for the widely used published tests, are usually 
chuck-full of reliability data, so we have not bothered to cite data for 

specific tests here. 

Reliability of Syntactic Complexity Scoring. The interscorer reliability 
achieved when an essay is segmented into T-units consistently falls above 
.90. Researchers have reported 'that trained scorers can analyze essays for 
T-unit with' little or no disagreement (0'Donnell, Griff in and Norris, 196?; 
Crowhurst and Piche, 1979). Crowhurst (1980) re p_orted^nter scorer reliability 
coefficients ranging from .97 to .99 calculated af ter' training and before 

* ; * 

Scoring. * 

Alternate form reliability of the major syntactical indices (T-unit ^ 
length, clause length, T-units per clause) has not been well researched. 
Witte and Davis (1980) have noted 0 ' Donnell' s. (1976) statement: ". . .there 

are no data to show how consistently these indices measure' the structural 

« t S . • ,; v 

complexity of an individual student's writing in various situations" (p. 33). 

t t \ . 

. _ , « i 
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Witte and Davis (1980), in what is apparently the only study of alternate form 
reliability of T-unit measures, found that T-unit length was not a stable 
individual trait, even within the same mode o:c discourse. They regard their 
finding as "tentative and inconclusive" and urge further research. f 

The stability of syntactic 'complexity measures across tasks has been the 
subject of somj research that focuses on how mode of discourse influences 
syntactical complexity. San Jose (1972) found that mean T-unit length 
differed significantly across four modes of. discourse. Crowhii^st and Piche 
(197/9), Crowhurst (1980) and several others have foun*l that T-unit length 
produc%ij*£n an argumentative essay is greater than that produced in 
narration. Witte and Davis (1980) also found that T-unit length was not 
stable across the modes of description and narration. The question of 

stability / of T^init length, particularly within a mode of discourse , bears 

/ . > ■ * ■ 

further investigation. However, most of this research shows only that 

j ■ 

I . 

different tasks yield diffe£ences in average scores for syntactic measures; 

' . x V 

the issue of relative ..order is skirted and, hence, reliability, in the 

s , 

psychometric sense, is not determined. £ 

Fredrick (1970) determined a number \of syntactic indies for themes 
written by eighth grade students writiten over a six week period, then 
correlated the indices from the^rstV-week period and second 3-yeek period 
and from odd and even pages..' He found\ "clause 'length, .clauses per T-unit, 
T-unit length, T-units per sentence, and sentence .length correlated .48* .2^, 
.56, .48, and .62, respectively, between first half and second half, and .69^ 

.74, .65, and .77 between qdd and even page samples" (p. 126). It should 
be noted that many of thi^student; essays used in\e search and in school 
evaluation programs are mych shorter than th'e 1000 or 500 word samples used ^in 
this study. J 

Reliability of Primary Trait Scoring . Mullis (1980). reports that strong 



interscorer reliability exists in primary trait- scoring. Although no 

correlations are reported, percentages of essays on which the first and second 
readers agreed ranged from 91 to 96 percent for various groups oi essays 
scored with the primary trait method (NAEP, 1961). Studies of .alternate form 
or cross-task reliability for primary trait scores are not available. 

Summary . At the risk of jeopardizing our professional reputations as well 
as any claims we may have to- sanity, we venture the following summaries of 
what is presently known about the various types of reliability for each of the 
six methods of measuring wViting skill. Table 1 indicates our judgment of how 
much information seems to, be available regarding each type j of reliability for 
each >aksessment method, while Table *2 indicates a generalized average or 
typical coefficient for each type of -reliability for each method, at least for 
those instances where the amount of information allows an estimate. 



Table 1. Summary -of How Much Information is Available about Reliability 

of Each Assessment Method 



Assessment 
Method 



Scorer 




Intra- 

much 

some 

none I 

NA 

much 

A 

NA 



.Inter- 
much 
some 
some 
NA 

much 
NA 



Alternate 
Form 

much 

s 

little 
none 
some 
little 

■j 

jrfuch 



Cross- 
task 

little 

little 

none 

none 

little 

much 



Internal 
Consistency 

i 

NA a 

NA 

NA 

none 

little 

much 



20 



i. 



V 
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Table 2. • Summary of Estimated Typical Reliability Coefficients 

for Each Assessment Method 



As se ssme nt _ 
Method 


Scorer 
Intra- Inter- 


Alternate 
Form 


Cross- 
task 


Internal 
Consistency 




0 noiisLiL 


.90 


.85 ' 


.60 


- .? 


_ 




Analy tical 


.90 


.85 


.60 


? 






Primary Trait 


.95 


.90 


- ? 


? 


- 




Computer 


.99 


.99 


.65 


? 


? 




Syntactic* 


.95 


.95 


? 


? 


? 




Objective Test 


.99 


.99 


.90 


.80 


.90 




Despite the 'near universal agreement 


about the 


importance 


of determining 




reliability for any measure, 


it seems apparent that 


there is 


still much work 




to be done on the 


reliability 


issue for 


these measures of writing skill. 





RELATIONSHIPS AMONG THE MEASURES 

With six different methods of measuring writing skill, we obviously have 
15 possible pairings of the methods; for' each pairing the question of 
' equivalence can be raised.. It is immediately apparent that some of the 

relationships have been studied "repeatedly, while others have not been studied 
at all, at least as defined by the published literature. For example, the 
relationship between holistic scores and objective tests has been studied 
often, whereas the relationship of primary trait scores to any of the other 
methods remains a mystery. In the following sections, each relation which has 
been the, subject of one or more studies will be treated. 

Holistic vs. Computer Scoring . Page and Paulus (1968) correlated 30 
computer countable variables called "proxes" with ratings of overall quality 
: " and reported a multiple correlation of .71. The proxes included such computer 

countable variables as average sentence length, frequency of various types of 
' punctuation, frequency of spelling errors, standard deviation of word length 

\ 

ERIC ' 21 



18 

and length of essay. Page and Paulus reported moderate correlations for 
several of the proxes, after using the proxes to predict ratings on two 
separate essays. Average word length (r-.37 for essay C and .51 for essay D), 
standard deviation of word length (r-.45 for essay C and .53 for essay D), 
number of commas (r«.36 for essay C and .34 for essay D), proportion of common 
words or. the Dale (r— .37 for essay C and -.48 for essay D), and essay length 
(r-.25 for essay C and .32 for essay D) were among the best predictors of the\ 
holistic rating. Average sentence length emerged as an additional strong 
predictor when the reliability of each prox was taken into account. 

Slotnick (1971, 1972, 1974) and Slotnick, Knapp and Bussell (1971) 
conducted a series of studies that built on the work of Page by expanding the 

. '*" v 

computer program to include 59 indicators (in contrast to Page's 30). 
Vocabulary, subordination, " and prepositions were computer-analyzed somewhat 
differently than in the Page study. In one study of college freshman writing, 
Slotnick et. al. (1971) report that five of the 59 indicators were 

J 

significantly correlated with the holistic essay score: number of sentences 
(r=.379), number of logical prepositions (r=.308), number of rare words . 
(r=.475), number of all logical prepositions, and number of Juotes (r=.312). 
Taking the four strongest indicators together, Slotnick et. al. reported a 
multiple correlation of .66 between the computer-generated score and the 
holistic essay score. A subsequent letter writing study two groups of 
.adults (Slotnick, 1974) revealed the remarkably high multiple correlations of 
.866 and .781 when the three indicators of number of different words in the ? 
essay, mean word length, and number of misspellings were used to. predict the 
holistic score. Thus Slotnick' s overall results were similar to Page's. 

Hogan and Sugano (1977) also developed a list of 30 proxes that built on 
the work of Page and Slotnick. They explored such proxes as vowels per word, 
specificity, and copulatives, in addition to the more common proxes— total 
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words, average word length, etc. Using 60 college freshman test essays rated 

holisticslly (high, middle and low), they obtained a multiple correlation of 

.65 with the proxes. Total words (r=.55), average word length (r=.20), 

standard deviation of word length (r=.31), number of commas per word^ (r=.40) , 

and vowels per word (r=.22) were a few of. the proxes that correlated 

positively with the holistic ratings. 

Computer analysis of essays or, more precisely, computer-generated scores 

have yielded correlations with holistic scores in the range of .65-. 86. These 

^results, remarkably consistent across a number of different studies, seem 

surprisingly high; and, perhaps even more surprising is the fact that there 

appears to be little or no contemporary effortin this ^rea of research. 

Holistic vs. Analytic , Some 'research in this area has attempted to j 

~5 ■ 1 

identify factors important in contributing to the holistic score. Diederich 

(1974) refers to the factor analysis that he, John French and Sydell Carlton 

conducted in 1961 on the ratings of 300 essays written by college freshmen. 

He identified the five factors of ideas, organization, wording/ flavor, and 

mechanics., These factors explained 43 percent of the variance in essay 

scores. The holistic scoring in this study was a sorting of the essays into 

nine piles with no training of raters. 

Few other studies jas sophisticated as Diederich' s exist. In Measures for 

Research and Evaluation in the Language Arts (Fagan et. al., 1975), a 

compilation of writing assessment instruments which includes many analytic 

scales, only one analytical instrument was validated by a correlational 

i 

study. The Glaz^r Narrative Composition Scale (a set of 18 scales to assess 
the quality of young children' s narrative essays) total score was found to 
correlate .80 with scores produced after a quick impression Q-sort. None of 
the 18 scale scores were individually correlated with a holistic score. 

Objective Test Scores vs. Holistic Scoring * Most research investigating 

G • 
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the relationship of holistic essay scores to objective test scores has 
revealed substantial although far from perfect correlations between objective 
test scores <&nd holistic ratings. Correlations generally fall in the .55-. 70 

range. • 

Research With College Students . , Most research conducted on the issue of 
essay scores vs, objeptive test scores has been related to college selection 
and/or placement and hence, has dealt with the higher developmental levels of 
writing skill. The Educational Testing Service and the College Entrance 
Examination Board have been the major contributors to research on this 
question, which has been investigated fairly thoroughly with upper secondary 
and college students. * 

The widely cited study by Godshalk, Swineford, and Coffman (1966), using a 
largely college bound group of high school juniors and seniors, reported 
correlations generally in the .30' s between several objective measures and 
single essays rated by two or three readers. But correlations of .57 to .71 
were obtained between objective measures and an elaborately constructed essay 
score (four samples, each scored by five readers). Hiiddleston (1954) found a 
fairly high correlation- between /the SAT verbal subtest and instructors' 
ratings of student writing ability (r=.76), showing the objective test to' be a 
better predictor of instructors' ratings than is the essay test. Pearson 
(1955) also reported a higher correlation between teachers' ratings of ability 
and the Scholastic Aptitude Test (r=.65) than between' the ratings and an essay 
test (r=:51). Breland and Gaynor (1979) reported correlations between single 
essays and single scores on the objective (multiple-choice) Test of Standard 
Written English of .56-;63*; see also Breland (1977). Similar results were 
reported by Wood and Quinn (1976). 

Research with Elementary Children . At least three- studies have researched 
the relationship between objective test' scores and holistic essay scores among 
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younger children. Ondrasi^k, Crocker, and Lamme (1979) compared 138 fourth 
graders' performance on four subtests of the Metropolitan Achievement Test 
with their performance oh two holistically scored essays, one that involved 
fiction-writing and one that involved a factual report task. They found 
moderate correlations between the holistic rating and the Word Knowledge 
subtest (r=.^5), the Reading Comprehension subtest (r=.52), the Spelling 
subtest (r=.4 3), and* the Language Arts subtest (r«.30). They concluded that 
the strength of the relationship observed was insufficient to suggest that 
standardized/ tests can be used to replace actual measures of writing* 

Hogan aid Mishler (1980) found somewhat higher correlations between 
Metropolitan Achievement Test subtests and holistically scored essays of third 

and eightl/ graders. They reported correlations generally in the". 5^-. 75 ^-ange 

/ 

for essa/ scores correlated with Punctuation and Capitalization, Listening 
Comprehension, Usage, Grammar and Syntax, Language Study Skills, and 
Spelling. Correlating the total score for performance on all subtests with, 
thf holistic score produced correlations of .69-. 83. Another Language subtest 
/part of a battery of Reading, Science, Social Studies and Math subtests) 
correlated .66 at grade \ and .71 at grade 8 with holistic essay scores. Thus 
Hogan and Mishler found correlations of the same general magnitude as those 
reported in studies of college-bound students. 

On the other hand, Moss, Cole and Khampalikit (1982) reported a somewhat 
lower correlation between the Language Test of the 3Rs Achievement Test and 
holistic essay scores at grade 4 (r-.20). While they reported correlations 
between the objective test score and the holistic essay score of .67 for grade 
7 and .75 for grade 10, the lower correlation they found at grade 4 led them 
to conclude that "our data suggest lower relationships at the elementary 
school level," in contrast to the other two studies. 

In sum, the relationship between objective test scores and holistically 
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scored essays has been reasonably well researched at the college and 
precollege level: correlations have revealed a substantial relationship. At 
least two studies have replicated these findings at the elementary level, 
while the other has suggested a weaker relationship for younger students' 
writing. 

Holistic Scoring Vs . Syntactical Maturity Scoring . The relationship 
between quality of writing and the syntactical maturity of writing has been 
studied several times at the college, \i±gh 'school and elementary level. In 
general, most of these studies have found little or no relationship between 
quality of student writing and the syntactical complexity of the writing. 
Although some studies have reported gains in both syntactical maturity and 
quality after a particular treatment (e.g., practice in sentence-combining), 
these studies have not shown a high or even moderate correlation between the 
two measures. It should be noted that the methodology used In many studies 
within this category involves contrasting (then testing for significance) the 
syntactic indices characteristic of high-rated 'and low-rated essays; hence, ' 
one must often infer the strength of the relationship between holistic scores 
and syntactic indices from mean differences or t-values. 

Research With College Students . At least two studies have reported simple 
correlations between college . f reshmen' s scores on holistically scored writing 
samples and several of the commonly used indices of syntactic development. 
The sets of correlations produced in each study are remarkably low, with each 
study turning up almost identical correlations between quality ratings and 
syntactical variables. Nold and Freedman (1977) attempted to determine which 
of the various syntactical measures might predict the holistic scores of 22 
Stanford freshmen, each of whom wrote four essays- Using the work of .Golub 
and Fredrick (1970), Nold and Freedman correlated 17 syntactical maturity 
variables with quality ratings of trained raters. They found a correlation of 
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-.08 between words per T-unit and quality, -.09 between words per main clause ( 
and quality, -.06 between words per subordinate fclause* and quality, and -.03 
betweenjsubordinate clauses per T-unit. (These correlations and others from 
the Nold and Freedman study should be read as positive .correlations because a 
low essay score indicated high quality. Each rater used a 1-4 scale with 1 
being the highest.) The variables that correlated most highly with essay 
quality <we re overall length (r=-.57), percentage of words in final free 
modifiers (r=-.42), percentage of finite verbs which have modal auxiliaries 
(r=*.38) and percentage of verbs which show be or have as auxiliaries (r=. 32) . 
Nold and Frcsdman concluded t^at "words per T and other standard developmental 
measures are not useful in .predicting perceptions of quality on the college 

/ 4 ' 

lever (p-' 174). ■ / * * 

In studying the influence of generative rhetoric on the syntactic maturity 
and writing effectiveness of 138 freshmen composition students, Faigley (1979) 
correlated .several syntactical maturity measures with holistic ratings of 
quality. Like Nold and Freedman, Faigley reported low correlations between , 
quality and words per T-unit (r=. 04), clauses per T-unit (r=-.07), and wo^ds 

per clause (r=.18). Also like Nold and Freedman, Faigley reported slightly , 

S , 
higher correlations between quality ratings and length (r=.30), and percentage 

of words in t final f ree modif ers (r=.25), although the magnitude of these- 

correlations is not quite as great as those reported by Nold and Freedman. 

- * I 

Faigley also found a correlation of .41 between quality and percentage or 
T-units with final free mqdif iers, which was the highest correlate he reported 
in his study. 1 

Gebhardt (-1978) did not report correlations between quality ratings and 

the 86 syntactical variables used' in her study of the writing of 500 

/ 

freshmen. Rather, she tried to discover how quality could be measured 

. > * ♦ * 

quantitatively by determining which variables were signif icajitly different for 
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33 "poor" and 21 "good" essays. She found that length of essay, mean 
subordinate clause length, extensive use of prepositional phrases, and 
coordinate conjunctive sentence beginnings were significantly different in the 
good and poor essays. T-unit length, on the other hand, was not significantly 
different. Martin (1980) found no relationship between T^unit length and 
ratings of freshman essays; rather, clause endings, free modifiers, and 
percentage of comm^fr verbs were significantly related to high quality of 

b 

writing.^ 

& 

At the college level, then, the evidence suggests that the relationship 
between commonly used syntactical maturity measures and quality ratings is 
generally weak. 

Research With Students in Grades 2-12 . Early developmental research, such 
as that'of Hunt (1965), Bateman and Zidonis (1966) and O'Donnell, Griffin and 
Norris (1967), did not generally concern itself with the relationship between 
quality of writing and syntactical measures. As Hunt said about 'his 1965 
landmark study of grammatical structures at thW grade levels, "In fihis study 
the word 'maturity 1 is intended to designate nothing more than 'the observed 
characteristics of writers in an older grade.' It has nothing to do with 

whether older students write ' better ' in any general stylistic sense" (p. 5). 

f . ■* 

However, in "addition to measuring syntactical growth after *a particular 

f & * * 

course of study (e.g., transformational grammar instruction), some researchers 

measured the quality of students writing as a kind of secondary post-test. * 

} : 

Mellon (1969); and 0 f Hare (1973) both included quality ratings in their 
experimental studies of transformational grammar and sentience combining, 

respectively, but neither reported correlations between 'the two measures. 

v i 

Mellon (1969) found that judged quality of writing actually decreased among 

, «* 

the experimental groups, while syntactical maturity increased among the 

* ■ ' ' 

experimental groups who had undergone transformational, grammar study. O'Hare 
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(1973) found both quality of writing and syntactical maturity increased in the, 
experimental groups who had practiced sentence-combining. Sullivan (1978) 
found that sentence combining exercises did enhance syntactic maturity but did 
not have an effect on overall quality of writing of eleventh grade students. 
Callaghan (1978) reported a similar conclusion for ninth grade students. 

Several "studies at the elementary and high school levels have more 
directly investigated the relationship between various syntactic measures and 
quality of writing by use of a contrasted group methodology. Golub and 
Fredrick (1970), in their study of the linguistic structures and deviations of 
writing of 160 fourth and sixth graders, compared high, middle, and low rated 
essays on 63 measures of linguistic structure. They found that many 
linguistic variables were significantly different for the- high and low rated 
essays, but words per T-unit, clauses per T-unit", and Words per clause were 
not among the significant variables; see also Golub and Fredrick (1971). 
Jurgens and Griffin (1970) found little relationship between overall quality 
and seven language features in compositions written in grades seven, nine, and 
eleven* They, 'jLike Golub and Fredrick, did not report correlations between 
quality ratings: and syntactical measures. Stokes (1979) found no significant 

j> • ! *v 

relationship between quality of writing and T-unit length in the writing of 

( ; — _ 

eighth, tenth, Snd twelfth graders, nor did Evans and Perkins. (1979) 'in their 

analysis of fourth, eighth* and eleventh graders -in the Oregon Statewide 
Writing Assessment. 

Veal (1974) studied the relationship between holistic scores and 
syntactical measures "as a validity study" for syntactic measures. Although 
he did not correlate the measures directly, he found that syntactic measures 
clearly distinguished between high and low quality writing in the second^ 
fourth, and sixth grades. More specifically, he found that words per T-unit 
distinguished between high and low rated essays at all three grade levels, but 
within some grade levels it failed to distinguish between high and middle 

• ' . 29 . . , 
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essays, or between low and middle or some other combination other than low vs. 
high. Hence, from this study one would infer a significant but weak 
relationship between rated quality and T-unit length. 

Several studies report a significant* relationship between syntactical * 

variables and overall quality ratings. None report correlations.. Chew 

, * b r 

(1978), in an analysis of 57*New York Regents essays, found that the papers 
with* the longest T-units were among those receiving the highest grades. 
Dilworth, Reising and Walfe (1978) fpund that superior-rated high school 
essays contained more words per T-unit, were longer, and exhibited higher 
levels of abstraction than lower rated essays. Likewise, Distef ano and 
Marzaiio (1978), in their analysis of 450 NAEP essays, found that T-uni't leitgth 
was a significant factor for predicting holistic scores for .9 year olds and 13 
year olds, but not for 17 year olds. 



/ 



aCrowhurst (1980) suggested that riiode of discourse could significantly 
influence the relationship between .quality and syntactic complexity. She 
found that 'high syntactic complexity was not associated with high quality 
ratings if the mode was narration. However, in argumentative writing, high 
syntactic complexity was associated with higher quality ratings at grades 10 
and 12 but not at grade 6. , 

At least two studies at the high school or elementary level have directly 
correlated essay scores with syntactical measures and both have found similar, 
low correlations. Howerton, Jacobson, and Seldon (1977) correlated 
Composition Evaluation Scale essay scores 1 with words per T-unit and i 
reported correlations of .17 at grade four, '.13 at grade six, .31 at grade 



lr Th±s score is generated through an analytical scale, in which the essay 
is judged on eight factors. However, since the Howerton et. al. study did not 
correlate each scale score with the syntactical measures, discussion, of this 
study seems to fit under "holistic vs. syntactic." Although the rating method 
was not holistic, a single score was produced to indicate quality. 
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nine, arjtd .18 at grade twelve. These correlations were not as high as those 
found betweln overall length and quality (r*.30 to ;54) and between percentage 
of total words misspelled (r=-.27 to -.50). The conclusion reached was that 
qualitative and quantitative measures are related since their stepwise 
multiple regression showed that from 21% to 57% of; the variance between 
quality ratings can be accounted for using the five varlablesof total length, 
total sentences, percent of unique words written, percent of unique words 
misspelled, and words per T-unit. However, only one of the common syntactic 
variables, words per T-unit, was used in this study and, as shown, it did not 
correlate highly with quality ratings. 

Stewart and Grobe (1979) investigated the relationship between fifth, 
eighth, and eleventh grade students' syntactical maturity and quality ratings 
given by trained teachers. In contrast to the Howerton, Jacobson and Selden 
study, Stewart and Grobe correlated quality ratings with words per clause and 
clauses per T-unit, as well as words per T-unit and some others. They 
reported signf icant correlations between quality of writing and words per 
T-unit (r=.30), words per clause (r».23), clauses per T-unit (r-.37) at 

grade five only. For fi^es eight and eleven) lower correlations were 
reported— for words "per^f-unit vs.- quality at graoe 8 (r=.19) and for words 
per T-unit vfe. quality at grade eleven (r=-.06). The correlations between 
quality and words p£r clause and clauses per T-unit fell into the similarly 
low range of -.19 to .20. Stewart and Grobe concluded that no strong 
significant relationship exists between holistic .scores and any of the three 
common measures of syntactic development, .except at the grade 5 level. They 
also concluded, as others have, that overall length correlate's more highly 
^ with quality (r«.36-.47) than do the syntactical measures. Grobe's (1981) 
more itecent study, a stepwise multiple regression, showed that none of 14 
"syntactical variables by themselvefe could accurately predict holistic scores 

( • .... 31 



, 28 
at grades 5, 8, or 11. \> " - 

Several studies, then,, have established at both the college level.and 
lower levels that measures of syntactical development seem to bear, at best, 

weak relationships to the rated quality of writing . 

■ * " » 

Objective Tests vs . Syntactical Complexity Measures * In most research on 

t * 

the relationship between syntactical complexity and writing quality, rated 
essays are u$ed as" correlates or as criterion measures in the prediction of 
quality. 'Since objective tests writing skills are widely used to measure 
writing and language growth, the relationship between these objective measures 

\ „ 

and the major indices of syntactical complexity would seem to be important. 

To what extent do T-unit counts, for example, correlate with particular 

objective, language test or subtest scores? The relationship between syntactic 

measures and objective language tests: has not been well researched. 

Simpson (1974X-Canducted *a canonical and multiple correlation study of 

measures of writing of 402 fourth, fifth, and sixth graders,* Instead of 

v *» 

attempting to predict quality ratings, Simpson identified significant 
predictors of two objective test scores and an essay score, usln^ the language 
portion of the Iowa Test of Basic Skills , the Watts Test of Connecting Words 
and Phrases , and the Writing Test (an> essay test) of the Sequential Test of 
Educational Progress . Student writing samples were "scored f or 56 % predictor 
measures, Including words per T-unit. He found the My kle bust syntax score, a 
weighted ratio of errors to words written, to be the. most important predictor 
of objective test performance with canonical correlations in' the neighborhood 
of .83 or above. T-unit length alone did not emerge as an important 
predictor, leading Simpson to conclude that "attempts to classify children or 
evaluate English programs solely on measures of T-unit length and , 
transformational structures do not account for the majbi; factors of writing ^ 
ability." ' - .* 
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Ondrasik, Crocker, and Lamme (1979) also completed a\anonical correlation 
.study of. the relationship between four objective subtest scores and measures 
of writing proficiency. However", neither words, per T-untCnbr any of the 
other common syntactical indices were used in the analysis. Rather, total 
number of T-units was jused as a liable. Low correlations of .17 and -.02 
were reported between number of T-units and, performance on the objective 

subtests. "'' < . , „ 

No other studies comparing performance on objective language tests with 

syntactical complexity of writing samples could be found. 

Computer vs. Analytic . Page and Paulus (1968), in addition to their wtork 
on computer prediction of holistic essay scores, also examined* the 
relationship of their thirty proxesto five analytical ratings. The 
analytical scale included .separate ratings of essays for creativity, ideas,^ 
style, mechanics, and organization. The correlations were all in the moderate 
io high range: creativity (r=.78), ideas (r=.78), style (r=. 77), mechanics 
(r-.6A) and organization (r-.69). " The surp rising finding that a composite of 
the 30 proxes was correlated most highly with creativity ratings seems to be 
accounted for in large measure by the contribution of the "essay length" 
prox. For the average of all five traits vs. the thirty 'proxes, Page and 
.Paulus report a multiple correlation of .72, similar to the multiple 
correlation found between the holistic scores and the proxes. Those proxes 
contributing the most to the prediction of the average of the five traits were 
length (r=-.26j, commas (r-.38), dashes (r='.32), standard deviation of word ^ 
length (r=.45) and spelling errors (r=-.19). 

Syntactic vs. Computer . Golub. and Kidder (1974) have developed the 
Syntactical Density Score (S.DS) which uses computer analysis of essays to 
produce a measure of syntactic maturity; see also Golub (1974). The SDS was 
designed by selecting the best 10 of 63. variables that attempted to predict 
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quality of writing in Go'lub and Fredrick's 1970 and 1971 studies, discussed 
elsewhere ^±n this paper\The ten variables are: 1) words per T unit; 2) 
subordinate clauses per T unit; 3) main clause word length; 4) subordinate 
clause word length; 5) number of modals; 6) number of be and have forms in the 
auxiliary position; 

7) number of ^prepositional phrases; 8) number of possessives; 9) number of 
adverbs of time; 10) number of gerunds, participles, and absolute phrases 
(unbound modifiers). ' - 

The computer program makes "decisions" about the syntactic structures that 
"probably" exist due to the pattern of punctuation in the essay. Kidder and 
Golub (1974) report a correlation of .96 between computer generated and hand 
tabulated scores for the syntactic features. 

Analytic vs . Objective Teat . Few studies compare performance on objective 
tests with analytically rated characteristics of student papers. Usually, the 
overall score produced through analytical rating would be correlated with some 
criterion (elg., Howerton et. al., 1977) but rarely are ratings on particular 
traits correlated with a criterion such as an objective test score. 

-DISCUSSION' AND GENERALIZATIONS 1 
The research on relationships among the various measures of writing skill 
admits of relatively few well-established generalizations. Nonetheless, in 
th\s final section we attempt to formulate a number of conclusions, identify 
major questions yet to be answered, and discuss some other problems relevant 
to the measurement of writing skill. 

1. The^lationship between holistic ratings of essays and objective 
test scores has been fairly well established. Correlations between the 
two types of measures are generally about .60. If this figure is 
corrected for unreliability in the objective test and in the scoring of 
the essay, the r increases to about .70, but If the^ correction is made to 
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include the alternate form or cross-task reliability of the essay, the 
corrected r would be" in the neighborhood of .80 or better. 

Recent research on the relationship between holistic scores and 
objective tests differs little either in its methodology or conclusions 
from that summarized by Huddleston (1954).. It might also be noted that 
there has been no abatement over the years in the disbelief in, even 
outright rejection of, these findings. 

2. Although the research on scorer reliability is now quite clear, 

■ i 

i.e. essays can be scored quite reliably, the reliability of essays across 
occasions or types of tasks has not been thoroughly documented. Evidence 
available on this latter issue, although meager, suggests the presence of 
a disconcerting amount of unreliable variance across occasions and tasks; 
and this problem would seem to beset all of the methods which depend upon 
a writing sample, i.e. all methods except objective tests. 

3. While analytic scales are invariably listed among the various 
methods of measuring writing skill, they are used very little in the 
formal research literature (and perhaps anywhere else, too). The bits of 
evidence which we do have about scores derived from analytical scales 
suggest that they behave very much like holistic scores, both in terms of 
the subscores and, even more so, in terms of the frequently used, total 
score obtained from analytical devices. In other words,, the subscales 
contain little unique variance, certainly far less than the originators 
and proponents of analytic scales suppose. Hence, for practical purposes, 
it is probably safe to assume that any generalizations developed for 
holi'stic scores will hold tfrue for analytic scales, too. 

4. Various syntactical measures bear little relationship to holistic 
ratings of quality of writing (and, therefore, presumably to analytical 
ratings) or to objective test scores. The relationships tend to be 
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negligible or, if significant at all, very weak. One does begin to wonder 
what the syntactic? indices are measuring. To be sure, some authors state 
quite Clearly that syntactic indices are ' intended to simply describe 
language, not to measure its Equality. But it is important to note that 
the syntactic indices are often used in practice to recommend continuation 
(or discontinuation) of instructional strategies and programs which 
apparently are designed to improve the quality of writing. 
' 5. Computer generated scores (weighted composites of computer 
c&untable features of a written work) yield surprisingly high correlations 
with the quality of writing, as defined by holistic scores. The 
correlations are generally in the rang'e of .60-. 70. Even some of the 
individual computer-counted features, such as length of essay, mean and 
standard deviation of word length, and indices of vocabulary load, 
consistently yield significant though moderate correlations with rated 
quality. Strangely, howeveP, no research on computer generated scores has 
been published since the spurt 6f activity with this method in the late 
'60's and early '70's. 

6. It seems odd that the two latter generalizations (#'s 4 and 5) 
could be simultaneously true, since computer analysis and syntactic 
analysis seem to have so much in common. Sometimes it almost seems as if 
the syntactic' analysis is ,too sophisticated, laying ever more complexly 
and obscurely defined indices on top of one another, thereby missing what 
are perhaps some rather simple, direct qualities of good Writing. To be 
sure, that explanation, if not downright philistine, is at least not very 
helpful. Or, it may be that the success of the computer generated score 
lies mainly in its reliance on combining several variables, each' of which 
has rather limited reliability, whereas the syntactic indices, each with 
rather limited reliability, usually stand alone. In any case, this 
question seems to beg for further analysis. 
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7. Primary .trait storing has been the subject of virtually no 
published research. It hardly seems appropriate to foist this method upon 
the world at t;#s time; although evidently it is being' pedalled across the 
country in an almost cavalier fashion- We 'know practically nothing about 
the measurement characteristics of the primary .trait method of scoring:- 

'■it 

its reliability as defined in the usual variety of ways, its relationship" 
to other measures, its relationship to external criteria, etc Because, it 
seems like a good idea hardly seems like an adequate basis for widespread., 
routine use of the technique, at least if we pay any respects at all to 
fundamental notions of good measurement practice.. All of this is not to 
say that primary trait scoring is not a good measurement technique. It Is 
only to say that at the present time we don't know very much about its 
measurement characteristics and, therefore, ought to confine Its use to- 
restricted research applications. 

8. There are a number of issues lurking in the literature on writing 
assessment which flirly cry out for empirical analysis. Without 
pretending to draw up an exhaustive list of these, we offer the following 
three topics as being high priority items in any research agenda. The 
first, which has already been mentioned, is the cross-task ^ 
generalizabiMty of the various .types of scores derived f rom writing 
samples. There is a widespread feeling that different types of tasks, as 
defined, for example, by the traditional "modes of discourse" 
(argumentative, narrative, etc) yield different results. Indeed, there 
is now good evidence that certain features of writing differ from one of 
these types of tasks to another. But these are average differences and 
may not affect relative order of performance; that is, the differences 
discovered to date may be nothing more than scale transformations. We 
simply don't know. 
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A second issue relates to the length of writing . sample required 
for analysis. One finds rather strongly propounded opinions on this 
point; with recommendations ranging from 20 minutes to two hours. 
However, there appears to be no empirical evidence on this issue. - 

Finally, while it is generally accepted that training of raters 
is an important prerequisite for use of scoring methods which depend » 
heavily on human judgment, there seems to. be no evidence regarding how 
much training is. enough. In many practical applications, training may be 
rather lengthy. In other instances, training is brief in the extreme, 
consisting of reading a page of instructions' and having a 5-minute 
discussion. Our suspicion is that some of the more elaborately designed^ 
training sessions are more fluff than substance, intended more for public 
relations than reliability. However, the issue is empirically resolvable 
and really should be addressed by a number of studies. 
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